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Literature Synthesis on Curriculum-Based 
Measurement in Reading 
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University of Minnesota 


In this article, the authors review the research on curriculum-based measurement (CBM) in reading 
published since the time of Marston’s 1989 review. They focus on the technical adequacy of CBM re- 
lated to measures, materials, and representation of growth. The authors conclude by discussing issues 
to be addressed in future research, and they raise the possibility of the development of a seamless and 
flexible system of progress monitoring that can be used to monitor students’ progress across students, 
settings, and purposes. 


Curriculum-based measurement (CBM) is a method for mon- 
itoring student growth in an academic area and evaluating the 
effects of instructional programs on that growth (Deno, 1985). 
CBM was designed to be part of a problem-solving approach 
to special education whereby the academic difficulties of stu- 
dents would be viewed as problems to be solved rather than 
as immutable characteristics within a child (Deno, 1990). In the 
problem-solving approach, teachers were the “problem solvers” 
who constantly evaluated and modified students’ instructional 
programs. For a problem-solving approach to be effective, it 
was necessary for teachers to have a tool that could be used 
to evaluate growth in response to instruction. CBM was de- 
veloped to serve that purpose. 

Two separate but related concerns drove the initial re- 
search into the development of CBM (Deno, 1985). The first 
was the concern for technical adequacy. If teachers were to use 
the measures to make instructional decisions, the measures 
would have to have demonstrated reliability and validity. The 
second was the concern for practicality. If teachers were to 
use the measures on an ongoing and frequent basis to evalu- 
ate instructional programs, the measures would have to be 
simple, efficient, easily understood, and inexpensive. These dual 
concerns led to the concept of “vital signs,” or indicators of 
student performance (Deno, 1985). CBM measures were con- 
ceptualized to be short samples of work that would be indi- 
cators, or vital signs, of academic performance. The samples 
would need to be valid and reliable with respect to the broader 
academic domain they were representing, but would also need 
to be designed to be given on a frequent and repeated basis. 

In 1 989, Marston reviewed the existing research on CBM. 
At that time, CBM was viewed primarily as a progress- 
monitoring tool in basic skills for special education students 
at the elementary-school level (although there were discus- 
sions and instances of its uses more broadly, for example, see 
Shinn, 1989). Research in reading focused on two measures: 
word identification and reading aloud. The results of Marston’s 


review provided support for the use of these two measures as 
indicators of general reading proficiency. In terms of reliabil- 
ity, results of five studies revealed test-retest reliability coef- 
ficients ranging from .82 to .97, with most coefficients above 
.90, and alternate-form reliability coefficients ranging from 
.84 to .96, with most coefficients above .90. Interrater agree- 
ment was .99. In terms of validity, 14 studies were reviewed. 
Criterion-related validity coefficients with published mea- 
sures of reading ranged from .63 to .90, with most above .80. 
Criterion-related validity coefficients with basal reading se- 
ries criterion mastery tests ranged from .57 to .86, with half 
above .80. Reading aloud correlated with teacher judgment 
and with various measures of reading comprehension, dis- 
criminated between lower and higher performing students, 
and was sensitive to growth. 

Since the time of Marston’s (1989) review, the research 
on CBM has expanded considerably — especially in the area 
of reading — making an updated review timely. Our purpose 
in writing this review is to gather, summarize, and reflect on 
the expansive body of literature published over the last 18 
years on CBM in reading. We focus this review on issues of 
technical adequacy as they relate to measures, materials, and 
growth. Given the vast amount of material and the diversity 
of topics covered in this review, we insert summaries and dis- 
cussion points throughout the article. In our final section, we 
draw conclusions, raise issues related to future research, and 
discuss the potential development of a seamless and flexible 
system of progress monitoring that can be used across stu- 
dents, settings, and purposes. 

Method 

The first step in our review process was to identify all articles 
addressing CBM in reading, writing, and math in kindergarten 
(K) to Grade 12. Electronic databases — including ERIC, 
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Science Citation Index Expanded, Psyclnfo, Digital Disser- 
tation, and the Expanded Academic Index — were searched 
using the following terms: curriculum based measurement, 
curriculum-based measurement , curriculum based measure, 
curriculum-based measure, general outcome measure, and 
progress monitoring. This initial search yielded 578 articles, 
dissertations, and reports related to CBM. Titles and abstracts 
of these documents were screened to confirm that they were 
related to CBM, and Method sections were screened to iden- 
tify those that reported results of empirical studies of CBM, 
yielding 160 documents. These documents were then reviewed 
by a team of educational psychology graduate students and 
grouped by subject area (reading, mathematics, spelling, and 
writing). Ninety (56%) of the documents addressed reading 
measures. In addition to documents identified through the lit- 
erature search, the complete set of technical reports produced 
by the Institute for Research on Learning Disabilities (IRLD) 
at the University of Minnesota was accessed. 

Given the vast database on reading, and the limitations 
of space accorded a review article, we chose to narrow our 
review in several key ways. First, we focused on research pub- 
lished since the time of Marston’s 1989 review. Second, we fo- 
cused on studies related to the technical adequacy of reading 
measures. Studies in which teachers used CBM to monitor 
progress and to make instructional decisions have been reviewed 
recently (Stecker, Fuchs, & Fuchs, 2005). Third, we focused 
on research conducted with school-age students and thus did 
not include research on the development of early literacy (e.g., 
Dynamic Indicators of Basic Early Literacy [DIBELS]; Ka- 
minski & Good, 1998) or preliteracy measures (e.g.. Individ- 
ual Growth and Development Indicators; McConnell, McEvoy, 
& Priest, 2002). Finally, we focused on three common CBM 
reading measures: reading aloud, maze selection, and word 
identification (see Note 1). Before reporting the results of our 
review, we present a brief discussion of the issue of validity, 
relating it specifically to the approach taken to investigate the 
validity of measures in CBM. 

Issues of Validity 

Messick (1989a, 1989b) described construct validity as a mul- 
tifaceted, but unified, concept that takes into account the ev- 
idential and consequential factors related to test interpretation 
and test use. This conceptualization is helpful in understand- 
ing the research on CBM, which has examined the evidence 
supporting the interpretation and use of CBM scores, as well 
as the consequences associated with that interpretation and 
use. In this review, we focus on the evidential basis for valid- 
ity of CBM measures, although, in keeping with the view of 
validity as a unified concept, we continuously consider the po- 
tential consequences associated with interpretation and use of 
the measures. 

In a CBM approach, evidence for the validity of a mea- 
sure is determined by examining the extent to which the 
measure serves as a vital sign or an indicator of a broader aca- 


demic domain (Deno, 1985). To determine the evidential basis 
for the validity of a measure, the pattern of relations — also re- 
ferred to as the nomological net (see Cronbach & Meehl, 1955; 
Messick, 1989b) — between the selected measure and many 
different criterion measures, each reflecting the construct of 
interest, is examined. Criterion measures might include other 
measures of the construct, such as standardized achievement 
tests, but might also include student age, group membership, 
and change in performance in response to an intervention. It 
is not just the pattern of relations but also the pattern of non- 
relations, or the discriminative validity of the measure, that is 
considered. For example, a measure of reading would be ex- 
pected to relate more closely to another reading measure than 
to a math measure. After a pattern of relations is established 
for particular students, settings, or purposes, the generaliz- 
ability (Messick, 1989b) of the measure, or the extent to which 
the validity of the measure holds across different settings, stu- 
dents, and purposes, is examined. 

Establishing the validity of a measure is an ongoing and 
recursive process (Messick, 1989b). Validity is determined not 
by one study or by one correlation but by the body of evidence 
amassed over time. The question arises, then, as to how one 
determines when the data are strong enough to support the use 
of the measure for the purpose for which it was intended. In 
other words, how good is good enough? There is no set stan- 
dard for determining when a measure is “good enough” to be 
considered valid for the purpose for which it was designed. 
One approach is to consider the consequences of decision 
making with and without the measure (Messick, 1989b). For 
example, one might ask whether using a measure improves 
the ability to make decisions over using no measure at all. In 
an area in which few measures exist, this question might be 
warranted. Under such circumstances, correlations of .30 
between the selected measure and the criterion measures 
might be considered strong enough to warrant use of an in- 
strument for decision-making purposes. However, in the area 
of reading, where many measures exist, it seems appropriate 
to apply more stringent criteria. In addition, the early research 
on CBM, as illustrated in Marston’s (1989) review, set a rel- 
atively high standard for validity and reliability, with many 
correlations reaching levels of .70 to .90. Thus, in this article, 
we adopt the following guidelines in interpreting the strength 
of the reliability and validity coefficients: Strong relations are 
those that are .70 and above; moderate relations are those that 
are .50 to .70; and weak relations are those that are below .50. 
We remind the reader, however, that these levels are arbitrar- 
ily chosen and used merely to help the reader interpret the 
strength of relations compared to previous research in read- 
ing in CBM. To evaluate the overall validity of the measures, 
it is necessary to consider the entire body of research. 

Our review is organized into three sections: Technical 
Adequacy of CBM Measures, Effects of Text Materials, and 
Issues About Measuring Growth. The studies described in 
each section are also outlined in Table 1 . We begin our review 

(text continues on p. 105) 
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Yovanoff, Duesbery, 6,012 4-8 ND Reading aloud 1 WRC 

Alonzo, & Tindal 

(2005) 4 Vocabulary: ,35-,63 

Comprehension: .60-.65 
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Paid Lunch: .75 
Free/Reduced Lunch: .66 
Reading aloud, lunch status, and race each 
significantly contributed to predicting perform- 
ance on MEAP and MAT-7 
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English Language Fluency: .44 
Teacher rating of English: .62 
Language Assessment Scale-English: .47 


93 


% 

o 

u 

U 




WD 

.9 

-a 

3 

a> 

06 


<D 

OJD b 

•5 -a 

j-h a> 

O o 
O o 

C/5 «H 
Oh 


S .9 

£ a 


<-W 4> 

o g 

<D C/3 

Q_ 3 

H 9 


JU 

'Bh 

9 

3 


-a 

3 

Uh 

a 




-a 

3 


< 

U 


o 

Q> 


u 

p< Y\ 


fi T3 C 
O 3 O 
'O o *o 
o *-s o 

d) 3 d> 


-o S T3 C -O 

3 O 3 O G 

o *o 'O o 

3d)C3d)C3^)CC^ 
OX) "S OX) ^ OX) OX) "o3 
C M G C/5 G MG M 

'S & 'S <l> 'S ^ *" 

c3 N cd N a! 

'■ d) 


(UJH <D <l>j2 <l> =2 
oi% oi^ oi^, oi% 


j 

j 

^4 pq 

a 

w 

a 


TT 

CO 


O 

o 

<N 

o 

c 

d) 

Q 


2 


£ 


U 


O 


U 

Oh 

£ 

O 


U 
06 ■ 

~ o 
53 

Ph(N 


— co — cNr-co^r- 

hh^oiciinh'Oin 


U 

06 

£ 


UAH) 
N 3 N 
3 d> 


X3 

3 

2 

OX) 

.9 

*3 

3 

d> 

C* 


J 

PJ 

J 


kJ 

m 

a 

— 1 

w 

-J 

PJ 




< 

J d 

w 


<C 

g 

j 

w 


H < ( 

P3 M 


.2 <=8 


c o 

d> 3 
«S d) 

7T Q 
cu , 

C« -§ 

g 3 

>; c 


73 


a £ 


e 2 


£* <N 
00 00 

irl 

t> \o 

. -S -C 



'g 

CN 

CN 

00 

5 

P 

00 

r 


t 

r 

CO 


G 


vq 


o 




3 

o 


3 


cj 3 


^T3 


O 

- G 

OX) 3 
3 pH 


"Q ■ -H TO 

3 S U 
d> 

06 


<N _j 
00 

•10-3 

I 00 d) 
£> .* o 

S u ^ 

C — d) 

C 3 d) 


00 


3 


S) E 

d) o 

pi K 


s E 
c ^ 

W on 
(U .S 
bp-o 

3 u'O'O 

M>oi '■° 

3 - 

73 

- 1 o c 

l H a- 

o 2 .22 S 
i^su 
00 


-o 


73 


3 


'O 

d> 

e£ 

(D 

d> 

£ 


u 

06 

* 


"3 

3 

*3 

OX) 

3 


PJ 

a 


O 

o 

o 




IT) 

o 

o 

<N 


Regular Lunch: .70 
Home Language Spanish: .65 
Home Language English: .63 
Intercept bias evident when ethnicity and 
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Hintze & Silberglitt 1,766 1-3 ND Reading aloud 1 WRC Minnesota Comprehensive Assessment Alternate form 

( 2005 ) (MCA) 

1 Predictive: , 49-.58 . 89-.91 

2 Predictive: . 61-.68 . 80-.85 

3 Concurrent: .69 . 83-87 


Results of discriminative analysis, logistic 
regression, and Receiver Operating Character- 
istic (ROC) curves indicated that reading aloud 
is an efficient method for predicting success on 
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by examining evidence related to the technical adequacy of 
the various measures used in CBM reading and consider gen- 
eralizability of the research to other students and purposes. 

Technical Adequacy of CBM Measures 

CBM Reading Measures 

We focus our review on the three measures most commonly 
used in CBM reading: reading aloud, maze selection, and 
word identification. In reading aloud, students read aloud 
from a passage, usually for 1 min, and the number of words 
read correctly is scored (Deno, 1985). Omissions, insertions, 
substitutions, hesitations, and mispronunciations are marked 
as errors. In maze selection, students read through a passage 
in which (typically) every seventh word has been deleted and 
replaced with three word choices — one correct choice and two 
distracters. (Rules for creating maze passages can be found in 
D. Fuchs and L. S. Fuchs, 1992.) Students read the passage 
silently, usually for 1 to 3 min, making selections as they read. 
The number of correct selections is scored. In word identi- 
fication, students read aloud from a list of high-frequency 
words, usually for 1 min, and the number of words read cor- 
rectly is scored (Deno, Mirkin, & Chiang, 1982). Omissions, 
insertions, substitutions, hesitations, and mispronunciations 
are marked as errors. Words are usually selected from word 
lists or from the reading curriculum. By far, the majority of 
research has focused on the reading-aloud measure. Recently, 
however, interest in maze selection and word identification 
has grown as CBM has been extended to younger and older 
students and as computerized progress monitoring has be- 
come a possibility. 

Reading Aloud 

Despite the early support for the technical adequacy of the 
reading-aloud measures, practitioners and researchers alike 
continued to express doubts about the relation between the 
simple measure of reading aloud from text for 1 minute and 
reading proficiency, especially proficiency in reading com- 
prehension (e.g., see Mehrens & Clarizio, 1993; Yell, 1992). 
Thus, in the 1980s and 1990s, efforts were made to more closely 
examine the nature of the relationship between reading aloud 
and general reading proficiency, especially reading compre- 
hension. This research took two different approaches. The 
first approach sought to clarify the relation between reading 
aloud and reading comprehension by considering alternative 
measures that might be more closely linked to reading com- 
prehension and by examining the theoretical underpinnings of 
the relation between reading aloud and reading proficiency. 
The second approach sought to examine the concomitant re- 
lation between CBM reading aloud and reading comprehen- 
sion, with a focus on the individual student. 


Clarification of the Relation Between Reading Aloud 
and Reading Comprehension. Fuchs, Fuchs, and Maxwell 
( 1988) compared the validity of CBM reading-aloud measures 
to that of other measures typically used to assess reading com- 
prehension, including cloze (where every seventh word is de- 
leted from a text and replaced with a blank), story retell, and 
question-answering measures. Participants were students with 
mild disabilities in Grades 4 to 8. Results revealed that reading- 
aloud scores correlated more strongly with scores on the com- 
prehension and word skills subtests of a standardized achieve- 
ment test (r = .91 and r = .80, respectively) than did scores 
from the other “typical” comprehension measures (rs = .76 to 
.82 for the reading comprehension and .66 to .76 for the word 
skills subtests, respectively). Results of the Fuchs et al. study 
suggested that reading aloud was more than just a measure of 
fluent decoding, a notion that was supported in subsequent re- 
search investigating the theoretical nature of the relationship 
between reading aloud and reading comprehension. 

Shinn, Good, Knutson, Tilly, and Collins (1992) used 
confirmatory factor analysis to examine the role of reading 
aloud as it related to decoding, fluency, and comprehension 
skills for students in Grades 3 and 5. A single-factor model of 
“reading competence” was validated for third-graders, with 
all reading skills making significant contributions. In con- 
trast, a two-factor model including decoding and comprehen- 
sion as two separate but highly related factors was validated 
for fifth graders, with reading aloud loading on the decoding 
factor. Hosp and Fuchs (2005) also observed changes in the 
nature of the relationship between reading aloud and reading 
proficiency associated with age. Relationships between CBM 
reading aloud and the Decoding, Word Reading, and Com- 
prehension subtests of the Woodcock Reading Mastery Test- 
Revised (WRMT; Woodcock, 1987) were similar in magnitude 
for students in Grades 2 and 3 (ranging from .82 to .88), but 
in Grade 4, lower correlations were observed for the Decod- 
ing and Word Reading subtests (rs = .72 and .73, respectively) 
than for the Reading Comprehension subtest (r = .82). 

Kranzler, Brownell, and Miller (1998), in a somewhat 
different approach, posed the hypothesis that the number of 
words read aloud from text in 1 min might merely be a re- 
flection of general speed of processing. Kranzler et al. exam- 
ined the roles of general cognitive ability, speed and efficiency 
of elemental cognitive processing, and reading aloud in the 
prediction of reading comprehension for students in Grade 4. 
Multiple regression analyses revealed a significant relation- 
ship between reading aloud and reading comprehension mea- 
sures that could not be explained by general cognitive ability 
or speed and efficiency of elemental cognitive processing. Re- 
sults suggested that reading aloud was not merely an indicator 
of general cognitive processing speed. Kranzler et al. noted, 
however, that although reading aloud had the highest stan- 
dardized regression coefficient when compared to cognitive 
ability and mental speed, it explained only 1 1% of the unique 
variance found in reading comprehension. 
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Concomitant Change in Reading Aloud and Reading 
Comprehension. The studies reviewed in the previous sec- 
tion focused on patterns of results across groups; however, the 
concerns raised by practitioners often focus on the nature of 
the relationship between CBM and reading comprehension for 
the individual student. One such concern is whether reading 
aloud and reading comprehension change concomitantly. Mar- 
ked and Deno (1997) addressed this issue by experimentally 
manipulating the difficulty level of reading material. Students 
in Grade 3 read passages that were two levels below, at, and 
two levels above grade level. Students also completed two 
comprehension tasks for each passage — question answering 
and maze selection. Results revealed that, on average, students 
read significantly fewer words in 1 min on the more difficult 
passages, answered fewer questions correct, and selected fewer 
correct maze choices, supporting the general relation between 
words read aloud in 1 min and reading comprehension. How- 
ever, at the individual level, it appeared that the amount of per- 
formance change was an important factor to be considered. 
Results revealed that for only 52% of the students did the rank 
ordering of reading-aloud scores match the rank ordering of 
comprehension scores on all three levels of materials. Taking 
a more liberal approach, and controlling for a ceiling effect 
on the comprehension measures, the percentage of students 
for whom rankings matched on only the highest and lowest 
levels were examined. This analysis resulted in an agreement 
of 100% and 96% for question answering and maze, respec- 
tively. These results suggested that a relatively large change 
in the number of words read in 1 min (perhaps as large as 
15-20 words) was needed to predict with certainty a con- 
comitant change in reading comprehension. 

A second concern raised by practitioners is the existence 
of “word callers,” that is, students who can read fluently but 
do not comprehend. Hamilton and Shinn (2003) examined 
teachers’ ability to identify word callers. Third-grade teach- 
ers were asked to identify one to two students who were word 
callers (WC) and one to two similarly fluent peers (SFP). Sim- 
ilarly fluent peers were students whom teachers judged as 
having fluency rates similar to the WC students but with 
higher levels of comprehension. Results confirmed differ- 
ences in comprehension levels between the students in the WC 
and SFP groups, with SFP students scoring higher on com- 
prehension tasks. However, results also revealed differences 
in reading fluency, with scores for WC students lower than 
those for SFP students, calling into question teachers’ ability 
to identify students as word callers. We note that the Hamil- 
ton and Shinn (2003) study did not address the actual exis- 
tence of word callers, just teachers’ judgment of word callers. 
To examine the existence of word callers, one would need to 
examine the relative standing of students on reading aloud and 
reading comprehension measures. Word callers would be 
those whose relative standing on the reading aloud measures 
were substantially higher than their standing on the reading 
comprehension measures. 


Maze Selection 

Although the number of words read aloud in 1 min demon- 
strated good technical adequacy in the early IRLD research, 
the measure was limited in the sense that it had to be admin- 
istered individually, it lacked face validity, and it was unclear 
whether the measure would be appropriate for older students 
who might presumably reach an asymptote in reading aloud 
performance. These factors, combined with improvements in 
technology and changes in the field of special education lead- 
ing to larger caseloads, led to consideration of the maze mea- 
sure. The maze could be administered in groups, appeared to 
be more of a reading comprehension measure than a reading 
aloud measure, could be administered via the computer, and 
was considered to be more acceptable for older students. The 
maze was not a new measure. An untimed version of the maze 
had been studied in 1970s by Guthrie as a measure of read- 
ing comprehension and was shown to have good stability, to 
correlate with standardized measures of reading proficiency, 
and to separate readers with and without disabilities (Guthrie, 
1973; Guthrie, Seifert, Burnham, & Caplan, 1974). The use 
of the maze as a timed measure within a CBM framework did 
not appear until the late 1980s and early 1990s. 

In 1989, Espin, Deno, Maruyama, and Cohen reported 
on the technical adequacy of a maze measure that was part of 
a group-administered screening instrument called the Basic 
Academic Skills Samples (BASS; Deno, Maruyama, Espin, & 
Cohen, 1989). The reading portion of the BASS consisted of 
three 1-min maze selection tasks that were approximately at 
a first- to second-grade reading level. The BASS was admin- 
istered to more than 2,000 students in Grades 1 through 6 
across 31 schools. Correlations between the BASS maze and 
1-min reading-aloud passages for a random sample of stu- 
dents from Grades 3, 4, and 5 were .77, .86, and .86, respec- 
tively. Data from the entire sample revealed a stable pattern 
of increase in maze scores from Grades 1 to 6, as well as from 
winter to spring within each grade. 

Fuchs and Fuchs (1992) extended the research on maze 
selection in their search for a CBM reading measure that would 
be suitable for data collection via the computer and that might 
have greater acceptance for teachers than would reading aloud. 
Technical adequacy and level of teacher acceptance were 
compared for several alternative CBM measures, including 
question answering, story recall, cloze, and maze selection. 
Maze selection in this study was a 2.5-min measure adminis- 
tered twice weekly for 18 weeks via computer. Earlier re- 
search (Fuchs & Fuchs, 1990) had revealed correlations of .83 
between scores on maze and reading aloud and of .77 between 
scores on maze and the Reading Comprehension subtest of 
the Stanford Achievement Test (SAT; Gardner, Rudman, Karl- 
sen, & Merwin, 1982). Results of the Fuchs and Fuchs (1992) 
study revealed that the maze task was sensitive to change in 
performance over time and, unlike other measures, had a rel- 
atively small ratio of slope to standard error of estimate (SEE), 
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making it easier to detect growth on a graph. In addition, 
teachers rated their satisfaction with maze highly, reporting 
that they believed the maze reflected multiple dimensions of 
reading, including decoding, comprehension, and fluency. Fi- 
nally, students reported that they liked taking the maze. 

In a direct comparison of the technical adequacy of 
reading aloud and maze selection measures, Jenkins and Jew- 
ell (1993) examined the validity of the two measures across 
Grades 2 to 6. All students in the study completed three 1-min 
maze tasks and three 1 -min reading-aloud passages. Passages 
were at a first- to second-grade level. Criterion variables were 
scores on the Gates-MacGinitie Reading Tests (Gates; MacGin- 
itie, Kamons, Kowalski, MacGinitie, & McKay, 1978) and the 
Metropolitan Achievement Tests (MAT; Prescott, Balow, Ho- 
gan, & Farr, 1984). Within-grade correlations were moderate- 
strong to strong for both measures, ranging from .63 to .88 
with the Gates and from .58 to .87 with the MAT. In Grades 2 
through 4, correlations tended to be stronger for reading aloud 
than for maze, but in Grades 5 and 6, this pattern of differ- 
ences disappeared. Looking across grades, correlations be- 
tween the reading aloud and the criterion measures dropped 
from the .80s in Grades 2 through 4 to .60s to .70s in Grades 

5 and 6. In contrast, correlations for the maze remained con- 
sistent across the grade levels, with most between .65 and .75. 
Finally, both measures revealed increases across Grades 2 to 
6, and from fall to spring within grade. For the reading aloud 
measures, change was greatest from Grades 2 to 3, after which 
it leveled off. Maze, in contrast, reflected more even rates of 
change across the grades. Results suggested that reading 
aloud might be a better measure than maze for primary-grade 
students, a conclusion supported by Ardoin et al. (2004), who 
found that adding a maze task did not add significantly to the 
prediction of performance on a standardized achievement test 
in reading for students in third grade. 

Word Identification 

Although both word identification (word ID) and reading 
aloud were included as potential indicators of reading profi- 
ciency in early CBM research (e.g., Marston, Deno, & Tin- 
dal, 1983; Marston, Lowry, Deno, & Mirkin, 1981; Shinn, 
Ysseldyke, Deno, & Tindal, 1986; Tindal, Marston, Deno, & 
Germann, 1982), reading aloud emerged as the more com- 
monly used measure, perhaps because it appeared to be more 
closely related to the construct of reading than did word ID. 
Reading aloud, however, proved difficult to use with begin- 
ning first-graders because many of these students could not 
read any words from text in the fall, creating a floor effect in 
the measure (e.g., see Bain & Garlock, 1992; Fuchs, Fuchs, 

6 Compton, 2004). As interest in early identification and pre- 
vention grew, interest in CBM word identification reemerged: 
The words presented in a word ID task could be controlled 
for difficulty and were not constrained by the requirement of 
fitting the words into a coherent story. 


The technical adequacy of several potential CBM read- 
ing measures for students in first grade was compared in a 
study by Daly, Wright, Kelly, and Martens (1997). Reading 
measures included word ID, letter reading, letter copying, 
letter-sound production, and letter-sound selection. Word ID 
probes were created using words selected from the Harris-Ja- 
cobson pre-primer word list (Harris & Jacobson, 1972). Re- 
sults revealed that letter reading and word ID produced the 
best technical adequacy data. Test-retest reliabilities for the 
word ID and letter-reading measures were .94 and .87, re- 
spectively, compared to .42 to .65 for the other measures. Con- 
current validity coefficients with the broad reading subtest of 
the Woodcock-J ohnson-Revised (Woodcock & Johnson, 1989) 
were .40 and .35, respectively. Predictive validity coefficients 
between the two measures and a passage-reading and word- 
ID task administered 4 months later were .73 (word ID) and 
.71 (letter reading) with passage reading and .71 (word ID) 
and .69 (letter reading) with word ID. Predictive validity co- 
efficients for the other measures ranged from -.09 to .53. 

Fuchs, Fuchs, and Compton (2004) compared the valid- 
ity of a word-ID task and nonsense word fluency (NWF) task 
for first graders at risk in reading. The word-ID task was cre- 
ated using words from the Dolch preprimer, primer, and first- 
grade-level lists. The NWF measure was taken from DIB ELS 
(Good, Simmons, & Kame’enui, 2001). Criterion measures in- 
cluded the word attack and word identification subtests of the 
Woodcock Reading Mastery Test-Revised (WRMT-R; Wood- 
cock, 1987) and the Comprehensive Reading Assessment Bat- 
tery (CRAB; Fuchs, Fuchs, & Hamlett, 1989). Students were 
tested in the fall and spring of the year on the criterion mea- 
sures and were tested at least once weekly on both CBM mea- 
sures. Alternate-form reliability for the word-ID and NWF 
tasks was .88 and .87, respectively. Validity was examined for 
both level and slope of performance for the two CBM mea- 
sures. In general, correlations with the criterion measures were 
consistently and reliably stronger for word ID than for NWF. 
Concurrent validity coefficients for word-ID level ranged from 
.52 to .93, compared to .50 to .80 for NWF. Predictive valid- 
ity for word-ID level ranged from .45 to .80, compared to .46 
to .64 for NWF. Finally, the slopes produced by word ID were 
more strongly correlated to the criterion variables (with most 
coefficients at .45 or above) than they were for NWF (with 
only 2 of 12 coefficients at .45 or above). Results not only lent 
support for the concurrent and predictive validity of word ID 
as an indicator of performance but also provided information 
supporting the technical adequacy of the growth rates pro- 
duced by the measure. 

In a recent study, Compton, Fuchs, Fuchs, and Bryant 
(2006) examined the use of a word ID measure within a 
response-to-intervention (RTI) approach for first-grade stu- 
dents. Students were administered a prediction battery in the 
fall of first grade consisting of word ID, rapid naming, phone- 
mic awareness, and oral vocabulary measures. Students were 
also progress-monitored for a 5-week period using the word- 
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ID task, and the slope of improvement was calculated. Stu- 
dents were followed until the end of second grade. Compton 
et al. examined the use of word-ID level, slope, and a combi- 
nation of level and slope as a part of a process for predicting 
performance of participants at the end of second grade. (They 
also examined different classification and data analytic ap- 
proaches, which are not reviewed here.) Results revealed that 
adding word-ID level and slope significantly improved clas- 
sification accuracy for the identification of at-risk students 
over and above the use of phonemic awareness, rapid naming, 
and oral vocabulary measures. 

Generalizability of Research: 

Student Populations and Purposes 

Recent work has examined the validity of CBM reading mea- 
sures — primarily reading aloud — to diverse groups of students 
and for new uses. In terms of generalizability to new student 
populations, CBM research has been extended to students 
at the secondary-school level and to students with diverse 
backgrounds and characteristics (racial/ethnic and language 
backgrounds, gender, socioeconomic status, and sensory dis- 
abilities). In terms of generalizability of uses, CBM research 
has been extended to examine the uses of reading measures 
for predicting performance on state standard tests. Given space 
limitations, and the fairly limited amount of research within 
each category, we briefly review these studies in this section. 

Student Populations. Extensions of CBM reading re- 
search to the secondary-school level initially focused on 
the use of reading measures to predict content-area perfor- 
mance, rather than to predict general reading performance 
(e.g., Espin & Deno, 1993a, 1993b; Espin & Deno, 1995; Few- 
ster & MacMillan, 2002; Yovanoff, Duesbery, Alonzo, & Tin- 
dal, 2005). More recent research has focused on predicting 
general reading performance (Espin & Foegen, 1996; Espin, 
Wallace, Lembke, Campbell, & Long, 2007; Muyskens & 
Marston, 2006; Ticha, Espin, & Wayman, 2007). These stud- 
ies have all been conducted at the middle school level. Results 
have generally shown that both reading aloud and maze ex- 
hibit strong alternate-form reliability and moderate to strong 
criterion-related and predictive validity. However, Espin et al. 
(2007) and Ticha et al. (2007) also found that whereas read- 
ing aloud did not reflect change in performance over time, 
maze selection did. Growth on maze was related to perfor- 
mance on a state reading test and to changes on a standard- 
ized achievement test in reading. 

Extensions of CBM reading research to students from 
diverse backgrounds have produced mixed results. With re- 
gard to racial/ethnic backgrounds, Hixson and McGlinchey 
(2004) and Kranzler, Miller, and Jordan (1999) found that 
CBM reading aloud resulted in overestimation of reading per- 
formance for African American students and underestimation 
of performance for Caucasian students. (Overestimation of 


performance might result in underidentification for services.) 
However, Hintze, Callahan, Matthews, Williams, and Tobin 
(2002) found that CBM reading aloud resulted in neither over- 
nor underestimation of performance for African American or 
Caucasian students. In terms of language, results generally 
have revealed moderate to strong reliability and criterion- 
related validity coefficients for reading-aloud measures with 
English learners (ELs; Baker & Good, 1995; Wiley & Deno, 
2005). In addition, gains on reading-aloud measures for EL 
students have been found to be similar to gains seen for non- 
EL students (Graves, Plasencia-Peinado, Deno, & Johnson, 
2005). 

In a comprehensive study of the effects of home lan- 
guage, gender, ethnicity, and socioeconomic status on the tech- 
nical adequacy of CBM reading aloud measures, Klein and 
Jimerson (2005) followed three cohorts of students ( N = 398) 
longitudinally from Grades 1 through 3. The first cohort con- 
sisted of Caucasian students who spoke English as their home 
language, the second group was Hispanic students who spoke 
English as their home language, and the third cohort was His- 
panic students who spoke Spanish as their home language. 
Results revealed a strong relationship between the CBM 
reading-aloud and SAT reading scores for all three groups of 
students at each grade level, with most correlations between 
.63 and .82. Linear regression analyses revealed that only a 
combination of ethnicity and home language resulted in bias 
in the measures. Specifically, the reading proficiency of His- 
panic students whose home language was Spanish was sys- 
tematically overpredicted (which could lead to systematic 
underidentification for services), whereas the reading profi- 
ciency for Caucasian students whose home language was 
English was systematically underpredicted (which could lead 
to systematic overidentification). 

Only two studies have examined the validity of CBM 
reading measures with students with sensory disabilities. The 
first, Morgan and Bradley-Johnson (1995), provided support 
for the validity of the measures for students in Grades 3 to 7 
with visual impairments. The second, Allinder and Eccarius 
(1999), did not provide support for the validity of either read- 
ing aloud or maze selection for students who were deaf and 
hard of hearing. 

Purposes. Recent work has extended the work on CBM 
reading measures to examine their use for predicting perfor- 
mance on state standards tests. Early studies focused on es- 
tablishing benchmark scores that would predict passing or 
failing a state reading test (Crawford, Tindal, & Stieber, 2001; 
Good, Simmons, & Kame’enui, 2001). Subsequent studies that 
examined correlations between CBM reading-aloud measures 
and performance on state standards tests reported diagnostic 
efficiency statistics, including sensitivity, specificity, positive 
predictive power, and negative predictive power (Hintze & 
Silberglitt, 2005; McGlinchey & Hixson, 2004; Silberglitt & 
Hintze, 2005; Stage & Jacobsen, 2001; see Note 2). Sensitiv- 
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ity is the percentage of students below a cut score who fail a 
test. Specificity is the percentage of students above a cut score 
who pass a test. Positive predictive power is the probability 
that a student with a score below the cut score will truly fail 
the test, whereas negative predictive power is the probability 
that a student with a score above the cut score will truly pass 
the test. 

Correlations between scores on the reading-aloud mea- 
sures and the various state reading tests generally ranged from 
.60 to .80 across studies. The exception to this pattern was the 
study by Stage and Jacobsen (2001), in which correlations be- 
tween reading aloud and the Washington state test were .43 to 
.44. The differences for the Stage and Jacobsen study might 
have been related to the nature of the Washington state read- 
ing test, which required both short-answer and extended writ- 
ten responses, and thus involved not only reading but also 
writing. The other state tests did not require written responses. 

Diagnostic efficiency statistics across the four studies 
were fairly consistent. Sensitivity values ranged from 65% to 
76%; specificity values ranged from 74% to 82%. Positive 
predictive power values ranged from 55% to 77% for all but 
the Stage and Jacobsen (2001) study, in which the value was 
41%. Negative predictive power values ranged from 83% to 
90% for all but the McGlinchey and Hixson (2004) study, in 
which the value was 46%. Across studies, the use of CBM 
added significantly to positive and negative predictive power 
above base rates of prediction. 

Summary and Discussion 

Research on CBM reading measures has provided support for 
and clarified the nature of the relationship between reading 
aloud and general reading proficiency; has examined alter- 
nates to reading aloud, including maze selection and word ID; 
and has examined the generalizability of the results to differ- 
ent student populations and for different uses. 

With regard to the reading-aloud measure, results gen- 
erally replicated earlier research demonstrating a strong rela- 
tionship between CBM reading aloud and reading proficiency, 
even when correlations were calculated within grade, ad- 
dressing a concern raised about the early CBM research (see 
Mehrens & Clarizio, 1993). Reading aloud was found to be a 
better indicator of reading comprehension than were other 
“typical” comprehension measures, and results revealed that 
reading aloud was not just a speed-of-processing measure. In 
addition, research provided insight into the theoretical nature 
of the relationship between reading aloud and reading profi- 
ciency for elementary-school students. 

However, reading aloud has limitations. First, reading 
aloud may not be the best choice for very young and older 
students. For readers at the very beginning stages of reading, 
reading-aloud measures produced a floor effect. Examination 
of an alternative measure, word ID, proved a promising alter- 
native for very beginning readers. Reliability and validity co- 


efficients for word ID were consistently strong for beginning 
readers, and research supported the use of word ID as a part 
of an RTI approach to early identification and prevention. Sec- 
ond, although correlations between reading-aloud and crite- 
rion measures remained moderate to strong across elementary 
school grades, they were strongest at the primary grades and 
decreased at the intermediate grades. No such decrease was 
seen for maze, which remained fairly stable across the grades. 
Thus, although reading aloud might be the best measure for 
primary-grade students, reading aloud and maze selection both 
seem to be appropriate for intermediate-grade students. For 
secondary-school students, maze may be the best choice. Al- 
though research is limited, initial results suggest that reading 
aloud does not reflect growth for middle school students, 
whereas maze selection does. Finally, if progress is to be mon- 
itored across school years, maze might prove to be the best 
choice. It has been shown to have reasonable validity and re- 
liability for students across Grades 2 through 8, and the growth 
rates across grades have shown greater consistency than 
those for reading aloud. The reasons for the age-related dif- 
ferences seen between the measures are not clear and should 
be examined more closely. Perhaps the teachers from the 
Fuchs and Fuchs (1992) study were correct: Perhaps maze 
reflects multiple aspects of reading proficiency to a greater 
extent than reading aloud does. If true, the separation of read- 
ing proficiency into decoding and comprehension factors at 
the intermediate elementary grades, as seen in Shinn et al. 
(1992), would not affect the maze correlations but would af- 
fect reading-aloud correlations. Regardless of the reasons for 
the differences, given the moderate to strong reliability and 
criterion-related validity coefficients for maze, and given the 
advantages offered by maze in terms of group administration, 
appropriateness for computerized administration, potential 
for cross-grade measurement, and acceptance by teachers, we 
believe that in the future more attention should be devoted to 
maze selection as a potential CBM reading measure. 

The extension of the research to various populations and 
for different uses is still in the early stages of development. 
Research at the secondary-school level is promising but has 
focused primarily on middle school students. Little has been 
done at the high school level. With regard to students of di- 
verse backgrounds, evidence suggests that although the CBM 
reading measures may have reasonable reliability and crite- 
rion-related validity for various groups, the measures may 
overestimate the performance of African American students 
and of Hispanic students whose home language is Spanish and 
underestimate the performance of Caucasian students. How- 
ever, results are mixed, and there is a need to further examine 
these patterns of relations. We suggest that performance-level 
differences between groups be factored out in future research. 

In terms of extensions for new purposes, the use of CBM 
reading-aloud measures for predicting performance on state 
standards tests has produced positive results, with the mea- 
sures producing generally strong correlations with state read- 
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mg tests and fairly good sensitivity, specificity, positive pre- 
dictive, and negative predictive power. These conclusions 
must be tempered by the fact that the technical adequacy of 
CBM reading measures for predicting performance on state 
tests may be affected by the nature of the state test, as seen in 
Stage and Jacobsen (2001). 

Aside from a need for further research on the generaliz- 
ability of the measures, two other issues have yet to be suffi- 
ciently addressed in the area of measure development. First, 
there is a need to further examine the relationship between 
reading aloud and reading comprehension at the individual 
level. Few studies focused on the individual level. One that did, 
Markell and Deno (1997), implied that large gains in reading- 
aloud scores might be necessary before gains in reading com- 
prehension could be assumed. There is a need to replicate this 
research and to examine the implications of the results for in- 
dividual progress monitoring. Relatedly, we think research on 
the existence of word callers is of interest. It is conceivable 
that a small group of students exists whose performance on the 
reading aloud measures is, relatively speaking, much higher 
than their performance on comprehension measures. 

A second issue relates to methods for linking measures. 
If word ID is to be used for beginning readers, reading aloud 
for primary-grade readers, and maze for intermediate and sec- 
ondary-school readers, how might the measures be linked to 
create a picture of growth across school years? The ability to 
link different measures across years would contribute to the 
development of a seamless and flexible system of progress 
monitoring. 

Effects of Text Materials 

Thus far, we have focused on various measures that can be 
used in CBM in reading. In this section, we turn our attention 
to the materials used to develop those measures. In 1994, Fuchs 
and Deno raised the question of whether instructionally use- 
ful performance assessment had to be based in the curricu- 
lum. The authors noted that although measures selected from 
the curriculum might have face validity, curriculum measures 
also introduced more error into progress measurement be- 
cause of variations in passage difficulty and student familiar- 
ity with passages. In addition, passages selected from within 
an instructional curriculum might limit generalizability to 
other reading curricula. 

A large number of studies were conducted beginning in 
the 1990s to address questions related to the materials used to 
develop CBM measures. The research fell into two broad cat- 
egories. The first addressed the question of curriculum source, 
that is, whether technical adequacy of the measures differed 
with the curriculum used to generate measures. Included in 
this category were studies comparing reading passages gen- 
erated from different curricula and studies comparing pas- 
sages generated from an instructional versus a “generic” or 
noninstructional curriculum. The second category of research 


addressed the difficulty level of the materials chosen, specif- 
ically whether measurement had to be done within the stu- 
dent’s instructional level. 

We note two points before describing the literature in 
this section. First, the research on materials has been con- 
ducted almost exclusively with reading-aloud measures. Few 
studies have examined maze or word ID measures. Second, 
there are a surprisingly large number of studies addressing the 
question of materials. Given the space limitations accredited 
to a review article, and the similarities in pattern of results, 
we report the results generally, referring to details of specific 
studies only when needed. 

Curriculum Effects 

Three general themes emerged from the research on the ef- 
fects of curriculum on CBM reading measures. First, level of 
performance differs significantly with curriculum source. 
Second, although technical adequacy does not vary with cur- 
riculum source, rates of growth may. Third, it is not necessary 
to match instructional and progress-monitoring material. 

With respect to the first theme — differences in levels of 
performance — results consistently reveal mean level differ- 
ences in scores on passages drawn from different curricula, 
beginning with Tindal, Marston, Deno, and Germann (1982). 
Studies have shown higher levels of performance on instruc- 
tional materials than on mainstream materials (Tindal, Flick, 
& Cole, 1992), on literature-based materials than on authentic 
materials (Hintze, Shapiro, Conte, & Basile, 1997), on basal 
materials than on literature-based materials (Bradley-Klug, 
Shapiro, Lutz, & DuPaul, 1998), and on generic materials than 
on basal materials (Powell-Smith & Bradley-Klug, 2001). In 
addition, higher levels of performance have been found on 
maze measures drawn from materials controlled for difficulty 
than on literature-based materials (Brown-Chidsey, Johnson, 
& Fernstrom, 2005). Mean level differences are important for 
progress monitoring only insofar as they are used to compare 
students across classes, schools, or districts or when compar- 
ing student performance at one point in time to performance 
at a later point in time. For such uses, it is important to keep 
the source of material constant. However, performance-level 
differences do not address the issue of technical adequacy of 
the measures as indicators of reading performance or growth. 
Measures may produce differences in levels of performance 
but be equally good indicators of general reading proficiency. 

The second theme to emerge from the curriculum mate- 
rials research does relate to technical adequacy. Results across 
several studies reveal that there are few differences in the 
technical adequacy of reading-aloud measures selected from 
different curricula. For example, Fuchs and Deno (1992) com- 
pared reading-aloud passages drawn from two published basal 
curriculum series and found no differences in the magnitude 
of correlations with the Woodcock Reading Mastery Test 
(WRMT; Woodcock, 1973) in Grades 1 through 6. Similarly, 
Hintze, Shapiro, Conte, and Basile (1997) found no differ- 
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ences in alternate-form reliability or criterion-related validity 
for passages selected from authentic and literature-based 
curricula and no differences in the passages for classifying 
students into groups. Hartman and Fuller (1997) found that 
test-retest reliability and criterion-related validity coefficients 
for passages selected from literature-based materials for stu- 
dents in Grades 1 through 3 were similar to those reported 
in the other research for passages selected from basal-series 
texts. Finally, Brown-Chidsey et al. (2005) found high corre- 
lations between performances on maze passages selected from 
controlled and literature-based material. 

Similar to the research on performance levels, research 
on the developmental growth rates (i.e., changes in scores 
across grade levels) produced by CBM reading measures re- 
veal few differences related to curriculum source (Fuchs & 
Deno, 1992; Hintze et al., 1997). However, results from stud- 
ies of intraindividual growth rates have been mixed. Bradley- 
Klug, Shapiro, Lutz, and DuPaul (1998) found no differences 
in growth rates produced by literature-based and basal-series 
curricula for second-graders. Differences in growth for fifth- 
graders were not statistically significant, but growth rates 
were .84 words per week for literature-based and .32 for basal- 
series probes. Brown-Chidsey et al. (2005) found that maze 
probes created from controlled and literature-based passages 
both produced positive growth rates from fall to winter to 
spring measurements. The significance of the difference in 
growth rates was not tested. 

Other studies have found significant differences in growth 
rates. Hintze, Shapiro, and Lutz (1994) examined the growth 
rates produced by literature-based and basal-series curricula 
for two groups of third-grade students who were taught with 
either a literature-based or basal-series curriculum. Growth rates 
for the literature-based and basal-series students were -.35 
and -1 when measured with literature-based probes, but were 
.66 and 1.70 when measured with basal-series probes. In the 
Hintze et al. (1994) study, passages were not equated for dif- 
ficulty. Passage difficulty was equated in a subsequent study 
(Hintze & Shapiro, 1997) in which students in Grades 2 to 5 
who were instructed with literature-based or basal-series ma- 
terial were monitored with literature-based and basal-series 
passages. Again, results revealed differences in growth rates 
related to curriculum source, although the results were in- 
consistent across grades. Students in Grades 2 through 4 
achieved greater growth when monitored with the literature- 
based, rather than with the basal-series, probes (a pattern op- 
posite that of Hintze et al. [1994], who found greater growth 
with literature-based probes than with basal-series probes for 
third-graders), whereas students in Grade 5 achieved greater 
growth with the basal-series probes (a pattern opposite that 
found by Bradley-Klug et al. [1998], who found greater growth 
rates on literature-based probes for fifth-grade students). 

It is difficult to determine why there is such inconsis- 
tency in results regarding growth rates. However, given the 
general pattern of inconsistency in the results (i.e., literature- 
based sometimes producing higher and sometimes lower growth 


rates), we hypothesize that factors other than curriculum source 
may be contributing to differences in growth rates. For ex- 
ample, as will be discussed in a later section, it is quite diffi- 
cult to determine the equivalency of “parallel probes” used to 
monitor progress, even when the probes are drawn from the 
same curriculum and matched on readability level. Moreover, 
slope values can easily be affected by a bunching of particu- 
larly difficult or easy probes near the beginning or end of a 
progress-monitoring session. One way to address these issues 
is to remove potential effects of nonequivalence of passages 
by counterbalancing the order of passages across students so 
that students do not read passages in the same order. Another 
method is to use techniques other than readability to establish 
equivalence of the passages (a point we will discuss later). In 
either case, based on current research, we cautiously suggest 
that growth rates are not affected by curriculum source; how- 
ever, as with level of performance differences, we would still 
recommend consistency in progress monitoring with respect 
to curriculum source. 

The third and last theme emerging from the research on 
materials is related to the need to match monitoring material 
to the materials used in instruction. The aforementioned 
Hintze et al. (1994) and Hintze and Shapiro (1997) studies re- 
sulted in no interaction between the growth rates generated by 
materials selected from particular curricula and the curricula 
used for instruction. In other words, students taught using a 
literature-based series did not grow differently on literature- 
based probes than on basal-series probes. Other studies have 
produced similar results. Tindal et al. (1992) found no dif- 
ferences in slope of improvements for 12 special education 
students in Grades 2 through 5 when monitored on highly con- 
trolled instructional materials or the general education basal 
curriculum. Powell-Smith and Bradley-Klug (2001) and Riley- 
Heller, Kelly- Vance, and Shriver (2005) found no differences 
in growth rates between probes derived from the students’ in- 
structional material and generic probes for second-grade stu- 
dents. 

Difficulty Level 

In terms of difficulty level of material, there are two general 
issues to consider. The first is whether students must be mea- 
sured in instructional-level material or whether they can be 
measured in material outside their instructional level. The sec- 
ond is the importance of establishing equivalence of passage 
difficulty for repeated progress monitoring. 

Need to Measure at Instructional Level. Address- 
ing the issue of whether students must be measured with 
instructional-level material, Fuchs and Deno (1992) compared 
criterion-related validity and developmental growth rates pro- 
duced by materials of various difficulty levels. Results revealed 
no differences related to material difficulty in the magnitude 
of the correlations. Average correlations across difficulty lev- 
els with the WRMT (Woodcock, 1973) were .91, ranging from 
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.89 to .93. Results also supported the use of a generic (in this 
case, third-grade) passage for tracking growth across grade 
levels. Growth rates produced with a common third-grade 
passage were relatively stable and linear. However, Fuchs and 
Deno also found that developmental growth rates decreased 
as the difficulty of the material increased; thus, sixth-grade 
material produced growth rates that were less steep than those 
produced with third-grade material. 

Several studies have examined the influence of material 
difficulty on intraindividual growth rates. Shinn, Gleason, and 
Tindal (1989) monitored the reading-aloud progress of mildly 
handicapped students in Grades 3 to 8 who were randomly as- 
signed to one of two groups: reading material one grade level 
below and above instructional placement, or two and four 
grade levels above instructional placement. Results revealed 
comparable slopes for students monitored one level below 
(4.3 words per week) and above (3.7 words per week) instruc- 
tional placement level. Although not statistically significant, 
slopes for two and four levels above instructional placement 
did differ in magnitude (4.55 vs. 2.35 words per week), sup- 
porting the earlier finding that an increase in difficulty leads 
to flatter slopes. Standard error of estimates (SEEs) did not 
differ by difficulty level. 

Dunn and Eckert (2002) found no differences in slopes 
or SEEs for grade-level versus challenging (approximately 
one grade level above instructional level) material for second- 
or third-grade students. Hintze, Daly, and Shapiro (1998), how- 
ever, did find differences in the pattern of results for grade 
levels. Students in Grades 1 through 4 were monitored with 
material at and 1 year above grade level. No differences were 
seen in alternate-form reliability between materials, nor in the 
slopes obtained for students in Grades 3 and 4. However, for 
students in Grades 1 and 2, growth rates were higher on grade- 
level material than they were on material above grade level. 
Differences were especially notable for first-grade students. 
The authors surmised that students in the beginning stages of 
reading development experienced more fluency problems 
when the text was difficult than did students who were beyond 
the beginning stages of reading. 

One study examined the effects of difficulty level on the 
slopes produced by word identification probes. Fuchs, Tindal, 
and Deno (1981) compared three types of word lists: those 
generated from grade-level material covered throughout the 
year (grade-level, comprehensive), from grade-level material 
covered during the time of the study (grade-level, limited), 
and across-grade-level material (preprimer to Grade 4). Dif- 
ferences were found in the magnitudes of the slopes produced 
by the measures. The steepest slopes were produced by the 
grade-level, limited word lists (.49), followed by the grade- 
level, comprehensive lists (.20), and the across-grade lists 
(-■07). 

Equivalence of Passages. A second issue related to diffi- 
culty level of CBM materials, and one that has received far less 
attention than other materials-related issues have, is the im- 


portance of establishing the equivalence of the “parallel” pas- 
sages used for repeated progress monitoring. As one might ex- 
pect, CBM reading scores are sensitive to the difficulty level of 
the passages. For example, each of the studies described in the 
section above reported mean score differences on the reading- 
aloud measure for passages of varying difficulty levels (e.g., 
Dunn & Eckert, 2002; Fuchs & Deno, 1992; Hintze, et ah, 
1998; Shinn et al., 1989). In one respect, this is a positive find- 
ing and demonstrates the validity and sensitivity of the CBM 
reading-aloud measures. On the other hand, with respect to 
ongoing progress monitoring, the finding is problematic. It 
implies that scores on repeated progress-monitoring measures 
will be affected by variation in passage difficulty (a finding 
confirmed by studies reviewed in a later section on growth). 

The importance of establishing equivalence in passage 
difficulty has been illustrated in two studies. In the first, Poncy, 
Skinner, and Axtell (2005) examined the effects of passage 
variability on reading-aloud scores. Participants were third- 
graders who read 20 grade-level passages with readabilities 
ranging from 2.8 to 3.1 grade level. Passages were presented 
to students in random order over a period of 4 days. Analyses 
revealed that 81% of the variance in students’ scores was due 
to student skill, 10% was due to passage variability, and 9% 
was unaccounted for. By controlling the difficulty of the pas- 
sages on the basis of students’ average scores, variance due to 
student skill increased and variance to passage difficulty de- 
creased. 

In the second study, Hintze and Christ (2004) compared 
grade-level material controlled for difficulty level with ran- 
domly selected materials from graded readers for students in 
Grades 2 to 5. Students were administered both controlled and 
uncontrolled reading-aloud passages over the course of 11 
weeks. The results indicated that estimates of both SEE and 
the standard error of the slope (SEb) were smaller when pas- 
sages were controlled for difficulty than when they were not. 

The results of Poncy et al. (2005) and of Hintze and Christ 
(2004) emphasize the need to establish passage equivalence. 
Yet, creating parallel passages is not as easy as it may seem. 
The most common method for establishing equivalence of CBM 
passages has been to examine the readability levels of the pas- 
sages via the use of common readability formulas. However, 
Ardoin, Suldo, Witt, Aldrich, and McDonald (2005) found 
only modest relationships between the reading levels assigned 
to passages via readability formulas and the number of words 
read correctly (WRC) in 1 min from those passages for third- 
graders. The readability formula that produced the highest 
and most reliable mean associations with WRC was Forecast 
(Sticht, 1973), a formula that has not been used in the CBM 
research. In a component analysis, Ardoin et al. (2005) found 
that the two components significantly related to WRC were 
syllables per 100 words and words not on the Dale-Chall list 
of 3,000 words. Results also revealed inconsistencies in the 
levels assigned to passages among various readability formulas. 

Others studies have demonstrated low correlations (rang- 
ing from rs = -.08 to .43) between scores on CBM reading-aloud 
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and readability formula levels (Bradley-Klug et al., 1998; 
Compton, Appleton, & Hosp, 2004; Hintze et al., 1994; Powell- 
Smith & Bradley-Klug, 2001) and low correlations among 
various readability formulas (r = .28; Compton et al., 2004). 
In addition, Compton et al. (2004) found that certain passage 
components were related to WRC in 1 min, including the num- 
ber of high-frequency and decodable words, the number of 
multisyllabic words (negatively related), and sentence length. 

Given the problems with readability formulas for deter- 
mining passage difficulty, how can one determine the equiv- 
alence of passages used for CBM progress monitoring? Ardoin 
et al. (2005) proposed a process whereby a set of passages is 
selected based on specific components found to relate to read- 
ing fluency, such as those identified in Ardoin et al. (2005) 
and Compton et al. (2004). Selected passages are then field- 
tested with a large number of students. In this system, only 
passages that fall within 1 standard deviation of the mean 
WRC would be selected for use in CBM progress monitoring. 
The effects of such an approach on the stability of the growth 
rates produced by reading-aloud passages have yet to be ex- 
amined. 

Summary and Discussion 

The research on curriculum source supports a robustness in 
the CBM measures: Even when developed from different cur- 
riculum sources, the measures seem to function consistently. 
Moreover, it does not appear necessary to develop CBM probes 
from material in which the student is being instructed. These 
conclusions are good news in terms of the development of a 
seamless and flexible system of progress monitoring. They 
imply that CBM progress monitoring using reading-aloud 
measures can be used across various curricula and instruc- 
tional approaches. However, the research related to difficulty 
level poses some restrictions to the general robustness of 
CBM materials. 

With respect to the issue of whether students need to be 
measured with instructional-level material, the literature in- 
dicates that CBM reading-aloud measures are fairly flexible 
with regard to difficulty level: Students can be measured with 
material that is easier or more difficult than their instructional 
level, and the technical adequacy of the measures is not af- 
fected. However, there are limits to this flexibility. Rates of 
growth may be affected if material is too difficult (e.g., two 
to three levels above instructional level). Further, there is 
some indication that beginning readers may be more affected 
by difficulty level than more advanced readers are. In terms 
of word identification, results reveal that the more closely the 
measure is tied to the instruction, the more sensitive it is to 
growth. Of course, this sensitivity must be balanced with gen- 
eralizability of the results. If words are selected that cover 
only 1 month of instruction, students are likely to hit a ceil- 
ing in their performance in a relatively short amount of time. 

The issue of difficulty as it relates to intraindividual 
growth monitoring is of greater concern. Generally, results 


emphasize the need to establish passage equivalence for CBM 
progress monitoring, especially if the measures are to be 
used as a part of a decision-making process that carries im- 
portant social consequences. The issue of passage equivalence 
is perhaps less of a concern if CBM is used by a classroom 
teacher to monitor student progress and evaluate instructional 
programs. Given the time-consuming nature of the type of 
approach suggested by Ardoin et al. (2005), and the positive 
treatment validity results reported in Stecker et al. (2005), use 
of passages selected from controlled sources, such as basal- 
reading series, is probably sufficient for such classroom use. 
However, if CBM is to be used as part of a school- or dis- 
trictwide decision-making process, or if it is to be used as part 
of an eligibility decision-making process, we feel that it is 
necessary to establish equivalence of passage difficulty for 
progress monitoring by adapting a process similar to that sug- 
gested by Ardoin et al. (2005). 

Issues About Measuring Growth 

We already have discussed in part some factors related to 
growth. For example, the amount of growth produced by 
CBM measures may vary with the curriculum source or dif- 
ficulty level of the passages, and error in the production of 
growth rates is reduced when the difficulty of the passages 
used for progress monitoring is controlled. Two additional is- 
sues specific to growth have not yet been addressed: (a) How 
reliable or trustworthy are the rates of growth produced by 
CBM measures? (b) Can standards for growth be determined, 
and do they differ for students by performance or grade level? 
As we did in the section on the effects of text materials, we 
caution the reader that the majority of research conducted in 
this area has been with the reading-aloud measure, with little 
attention devoted to word ID or to maze selection. 

Reliability of Growth Rates 

The research on determining the trustworthiness of the growth 
rates produced by CBM measures includes examination of the 
dependability of single CBM scores and the reliability or 
accuracy of the slopes produced by multiple, repeated scores. 
With regard to single scores, studies have examined the amount 
and sources of error surrounding single CBM scores. Christ 
and Silberglitt (in press) calculated typical standard error of 
measurement ( SEM ) values across students in Grades 1 to 5, 
taking into consideration various levels of measurement reli- 
ability and sample variability. Values were calculated for data 
collected in fall, winter, and spring (three measurements per 
data-collection period) across a period of 8 years for 8,200 
students. Results reveal that the median SEM across grades 
and conditions was 10 words read correctly, with a range of 5 
to 15. Reliability, grade, and sample diversity affected SEM s, 
with smaller SEM s associated with higher levels of reliabil- 
ity, lower grade levels, and less sample variability. The au- 
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thors suggested that the use of standard SEM s could aid in the 
interpretation of CBM reading aloud data. 

Sources of error associated with CBM scores have been 
examined recently via the application of G theory, or gener- 
alizability theory, to CBM reading aloud. G theory is designed 
to assess the dependability of behavioral measurements by 
specifying portions of error that can be accounted for by var- 
ious situational variables under which measurements are taken 
(Cronbach, Gleser, Nanda, & Rajaratnam, 1972). Through G 
theory, “portions of variance that in classic test score theory 
are simply attributed to random error” (Hintze, Owen, Sha- 
piro, & Daly, 2000, p. 53) are identified and explained. Re- 
sults of a series of studies applying G theory have revealed 
consistent support for the dependability of the CBM reading 
measures for inter- and intraindividual decision making at the 
elementary school level, with the majority of variance in CBM 
scores accounted for by individual variation and grade level 
(Hintze et al., 2000) or individual variation and group mem- 
bership (Brown-Chidsey, Davis, & Maya, 2003; Hintze & 
Pelle Petitte, 2001). 

Although dependability of individual data points con- 
tributes to the dependability of the slope, it does not guarantee 
it. Other factors must be considered with regard to slope. 
Several approaches have been taken to evaluate the trustwor- 
thiness or the reliability of the slopes produced by CBM mea- 
sures. In early CBM studies, the focus was on whether CBM 
reading measures were sensitive to change in performance. 
For example, Marston and Deno (1982) and Marston, Deno, 
and Tindal (1983) compared change in scores on reading-aloud 
and word ID measures to changes in scores for other reading 
measures, such as standardized achievement tests or basal- 
series tests. Results revealed that the CBM measures were 
more sensitive to growth over a short period of time than were 
these other measures. 

Later studies focused more closely on the statistical prop- 
erties of the growth rates produced by repeated CBM data. 
Skiba, Deno, Marston, and Wesson (1986) examined slopes 
for CBM reading-aloud data collected three to five times per 
week over a 6-month period for 67 students with disabilities 
(Grades 1-7). The mean increase in number of words read per 
week was 1.55 words per minute and tended to have an in- 
verse relationship to grade level (a pattern that has been repli- 
cated in other research that will be reviewed later). The mean 
SEE across grade levels was 10.17, with a range of 8.45 in 
Grade 1 to 1.56 in Grade 4. 

In 1992, Fuchs and Fuchs examined the ratio of the slope 
value to the standard error of estimate. SEE indexes the de- 
gree of intraindividual instability in the CBM data. A large 
SEE in proportion to the slope makes the graphs difficult to 
interpret. Fuchs and Fuchs (1992) found that maze had a rel- 
atively small SEE in proportion to the slope when compared 
to the other measures that might serve as alternatives for read- 
ing aloud. 

Shinn et al. (1989) described a desirable slope as one that 
would be (a) easy to compute and interpret, (b) accurate in the 


sense of not producing systematic over- or underpredictions 
of performance, and (c) precise in the sense that individual er- 
rors of prediction are minimized. In a series of studies, these 
characteristics were compared for different methods for slope 
calculation (Good & Shinn, 1990; Parker & Tindal, 1992; 
Shinn, Good, & Stein, 1989). Results of these studies sup- 
ported the use of ordinary least squares (OLS) for calculating 
slope compared to other methods (see Note 3). OLS was not 
the easiest method for calculating slope, but it did not sys- 
tematically over- or underpredict future performance given a 
relatively small number (e.g., 10) of data points, and it mini- 
mized individual errors of prediction. Results of the slope 
studies revealed negatively accelerating growth rates within 
the academic year. 

Dunn and Eckert (2002) discussed limitations of studies 
of slope, stating that the studies compared line-fitting meth- 
ods to each other rather than to an absolute standard of tech- 
nical adequacy and focused primarily on predicting later data 
from earlier data in the same data set. Dunn and Eckert pro- 
posed examination of the correlations between words read 
correctly and time (school day) as an indicator of accuracy of 
the slope. The square of the correlation coefficient would re- 
flect the amount of variability in the words read correctly due 
to time. A higher percentage would indicate a more accurate 
slope line. Dunn and Eckert compared slopes for grade-level 
and above-grade-level material (see description of the study 
in the Materials section). Results revealed median correlation 
coefficients of .15 and .14, respectively, indicating that very 
little variability in the number of words read correctly could 
be attributable to time. 

Both Hintze and Christ (2004) and Christ (2006) con- 
sidered SEb in their investigations of slope reliability. Hintze 
and Christ (2004; reviewed earlier) found that controlling the 
difficulty of progress-monitoring material reduced both SEE 
and SEb. Christ (2006) examined the likely magnitudes of 
SEb for different values of SEE and for different durations 
of progress monitoring. Results revealed that the longer the 
progress-monitoring duration, the smaller the SEE and the 
smaller the SEbs (9.19 for 2 weeks compared to .42 for 15 
weeks). If one assumed a moderate amount of SEE, 9 to 10 
data points were needed to reduce SEbs to levels below 1 . Ten 
to 12 data points reduced SEbs to between .59 and .78. Christ 
(2006) discussed the importance of controlling testing condi- 
tions and passage difficulty to reduce SEE. 

Most studies of slope have focused on the stability of the 
slopes, but little attention has been paid to the validity of the 
slopes generated by CBM data. We use validity in this sense 
to refer to the degree to which slope values predict perfor- 
mance on external criterion measures. With the development 
and ease of use of new statistical techniques, investigations 
of the validity of slopes have become easier to accomplish. 
For example, Shin, Deno, and Espin (2000) used Hierarchi- 
cal Linear Modeling (HLM) to examine growth rates on maze 
selection measures for second-grade students over the course 
of a year. Results revealed that maze selection sensitively re- 
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fleeted improvement in student performance across the year 
and reflected interindividual differences in growth rates. In 
addition, growth on maze was positively and significantly re- 
lated to later performance on a standardized reading achieve- 
ment test. Ticha et al. (2007) and Espin et al. (2007) used 
HLM to examine growth rates produced by maze selection for 
middle-school students. They found that growth on a maze se- 
lection task was significantly related to performance on a state 
standards test and to growth on a standardized achievement 
test. Finally, Speece and Ritchey (2005) used growth-curve 
analysis to examine the growth rates produced by reading 
aloud for a sample of at-risk first-graders. Results revealed 
that rate of growth on the reading-aloud measures in first 
grade predicted rate of growth in second grade, as well as end- 
of-year performance in second grade. 

Standard Rates of Growth 

A question often posed by practitioners is how much growth 
one can expect on CBM measures and whether growth ex- 
pectations differ by age or performance levels. Several stud- 
ies have examined these standard rates of growth. Marston, 
Lowry, Deno, and Mirkin (1981) examined growth rates from 
fall to winter to spring for students in Grades 1 through 6 for 
both reading aloud and word ID. Fifty-eight randomly se- 
lected general education students participated. Both word ID 
and reading-aloud measures reflected growth over time and 
exhibited steep continual, linear growth. Dramatic changes were 
observed in the earlier grades, with less dramatic changes in 
the upper grades, a result seen in other studies (e.g., Deno et 
al„ 1982; MacMillan, 2000). Hasbrouck and Tindal (1992) 
published CBM norms for reading aloud on the basis of data 
collected over a period of 9 years from 7,000 to 9,000 stu- 
dents in Grades 2 through 5. They reported normative perfor- 
mance levels by time of year (fall, winter, spring), grade, and 
percentile level within grade. 

The limitation of grade-level standards is that they do 
not take into account an individual’s beginning level of per- 
formance. Students who are low performing may never reach 
grade-level standards, even when they are improving (Fuchs, 
Fuchs, Hamlett, Walz, & Germann, 1993). Fuchs et al. (1993) 
addressed intraindividual norms for weekly growth rates on 
CBM reading-aloud and maze measures. Participants were two 
samples of students in Grades 1 through 6. During the first 
year of the study, students (N = 1 17) were measured in grade- 
level material once each week using reading aloud. During the 
second year, students ( N = 257) were measured in grade-level 
material at least monthly using a computerized maze selec- 
tion program. Results revealed that for the majority of stu- 
dents, a linear model fit the growth data for both reading aloud 
and maze, but for a proportion of the students, a quadratic 
model fit the data. For most of these cases, a slightly negative 
pattern of growth was found. Weekly growth rates were cal- 
culated by grade level. Results revealed differences in growth 
rates as a function of grade for reading aloud but not for maze. 


As with earlier research, the magnitude of slopes was found 
to decrease with an increase in grade. However, similar to the 
findings of Jenkins and Jewell (1993), no such relationship 
was found for maze. Fuchs et al. (1993) presented what they 
termed realistic and ambitious standards for weekly growth. 
For reading aloud, these standards were, by grade level, 2 and 
3 words per week (Grade 1), 1.5 and 2.0 (Grade 2), 1.0 and 
1.5 (Grade 3), .85 and 1.1 (Grade 4), .5 and .8 (Grade 5), and 
.3 and .65 (Grade 6). For maze, realistic and ambitious stan- 
dards for growth remained the same across grades and were 
.39 and .84 word choices per week, respectively. 

Deno, Fuchs, Marston, and Shin (2001) established aca- 
demic growth standards for students in general and special ed- 
ucation under typical and effective instructional practices. 
Growth rates under typical conditions were generated using 
extant databases from four educational agencies across the 
country. Growth rates under effective instructional conditions 
were generated by combining data across studies in which 
instructional practices had been implemented and shown to 
be effective. As with previous research, results revealed the 
greatest growth in the early grades, with a decrease in growth 
rates with age. Typical growth rates for general and special 
education by grade level were 1.8 and .83 (Grade 1), 1.66 and 
.57 (Grade 2), 1.18 and .58 (Grade 3), 1.01 and .58 (Grade 4), 
.58 and .58 (Grade 5), and .66 and .62 (Grade 6). Large dif- 
ferences existed between the growth rates of general and spe- 
cial education students up until Grade 4. In the second part of 
the study, Deno et al. (2001) examined growth rates for stu- 
dents with learning disabilities who had received effective in- 
structional treatments across five different studies. Growth 
rates for both reading aloud and maze were reported. Growth 
rates for reading aloud ranged from .83 to 2.10 words per 
week. Growth rates for maze ranged from .56 to .70 words 
per week. With regard to the reading-aloud data, the results 
were striking in the sense that under effective instructional 
conditions, students with LD exhibited growth rates close to 
the typical growth rates for general education students seen in 
the first part of the study. 

Summary and Discussion 

Research on the technical adequacy of the growth rates pro- 
duced by CBM measures supports the dependability of single 
measures; however, results are more mixed with regard to 
slope. Slopes have been found to predict future CBM perfor- 
mance and have also been found to predict performance and 
progress on external measures. However, concern exists about 
the variability of the data points around the slope and the vari- 
ability of the slope values themselves. Concern also exists 
about establishing equivalence of passages used for progress 
monitoring. More research is needed on the technical proper- 
ties of the slope values produced by various CBM reading 
measures, especially as they relate to the use of CBM within 
the framework of high-stakes decisions, such as those in- 
volved in eligibility determination. Future research should 
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also examine the relative effects of exceptionally high or low 
data points at various time points on slope values, and practi- 
tioners’ abilities to interpret slope values and apply them to 
instructional decision making. 

Although studies of standard growth rates make an im- 
portant contribution to the CBM literature, the results must be 
viewed with caution. First, as acknowledged by the researchers, 
none of the studies employed a nationally representative data 
set. Second, as discussed earlier in this review, the growth 
rates obtained in these studies may have been affected by the 
materials in the studies, specifically the equivalence of the 
passages used to establish the growth rates. To illustrate this 
point, perusal of the various studies included in this review 
reveal growth rates for general education students in Grade 4 
ranging from .24 to 2.69 words per week. Similar diversity in 
growth rates can be seen at the other grade levels. It may be 
important to consider establishing nationally standard growth 
rates by creating a standard set of passages that are controlled 
for difficulty and measuring a nationally representative sam- 
ple of students on a weekly basis. In addition, following the 
lead set by Fuchs et al. (1993), it seems important to consider 
both typical and ambitious growth standards. 

Conclusion 

We begin our Conclusion section with a return to the question 
posed at the beginning of the article: Are CBM measures 
“valid enough” for the purposes for which they are used? We 
believe that the data have supported the validity of the CBM 
reading-aloud measure for use by classroom teachers as an in- 
dicator of the performance and progress in reading for ele- 
mentary school students. Grades 2 to 5. The measures have 
been shown to relate to a variety of criterion measures across 
a multitude of studies conducted over many years with dif- 
ferent participants, methods, materials, and researchers. In ad- 
dition, the theoretical basis for the strength of reading aloud 
as an indicator of general reading proficiency has been sup- 
ported, and the measures have been shown to be dependable. 

Despite the positive support for the reading-aloud mea- 
sures, we offer some words of caution about the CBM reading 
measures in general. First, with respect to reading aloud, the 
measures have not always been found to be strongly related to 
criterion measures. Although correlations are consistently in 
the .70s or above, there are studies that have resulted in corre- 
lations in the AOs between reading aloud and criterion variables. 
A closer look at the reasons for these exceptions is in order. 
Research has suggested that the relationship between reading 
aloud and criterion variables may decrease as students get 
older; however, other factors may also influence the relation- 
ship, such as the influence of various passage characteristics 
on the relation between reading aloud and criterion variables. 

Second, the overwhelming majority of research in CBM 
reading has been done with reading aloud, and with samples 
of students in Grades 2 through 5. The question of the valid- 


ity of the measures for younger (K-Grade 1) and older (Grades 
6-12) students, and for students of diverse backgrounds, is 
still open for examination. With regard to student age, research 
suggests that a word-ID measure may be more appropriate 
than reading aloud would be for younger, beginning readers, 
and that maze may be more appropriate for older, middle 
school students. With regard to students with diverse needs, 
correlations between the CBM measures and criterion vari- 
ables have generally been positive across various groups, but 
there is evidence that CBM reading aloud overestimates the 
performance of African American and EL students, which 
could result in underidentification for services. For students 
with sensory disabilities, too few studies have been conducted 
to draw conclusions. 

Finally, we raise the issue of the uses of CBM measures. 
CBM measures are no longer simply a means for special ed- 
ucation teachers to evaluate the effects of instructional pro- 
grams. The use of CBM measures has shifted from that of 
monitoring the progress of students in special education to use 
in high-stakes decisions that carry important social conse- 
quences. This shift creates a new set of standards for the mea- 
sures. The validity of the measures for these new purposes has 
yet to be established. There is little known, relatively speak- 
ing, about the technical characteristics of the slopes produced 
by CBM measures. For example, what is the best method for 
determining validity and reliability of the slope? How many 
data points are needed to obtain a reliable and valid slope? 
Does this number differ with the age of the student, the ma- 
terial used, or the equivalency of “parallel” passages? How 
do we establish parallel passages? There is also little known 
about normative or ambitious rates of performance and growth. 
For example, how much growth should be expected from 
students at different age and performance levels? Should na- 
tional norms be developed? If so, must standard sets of mate- 
rials be developed for this purpose? Finally, little is known 
about teachers’ understanding of CBM progress data and the 
thinking processes teachers use as they interpret and use data 
for decision making. For example, how accurate are teachers 
at interpreting CBM data? How long does it take them to learn 
how to interpret and use CBM data? How easy is it for teach- 
ers to connect progress-monitoring data to instructional deci- 
sions, and are there methods to enhance their ability to tie data 
and instruction together? We believe that areas must be ex- 
plored if CBM is to be used as a part of a decision-making 
process related to determining the need for special education 
services. 

Despite these words of caution, we believe that the flex- 
ibility and durability of CBM in reading across different mea- 
sures, materials, settings, students, and situations is notable. 
This flexibility and durability provide the basis for consider- 
ing the development of a seamless and flexible system of 
progress monitoring that could be used across students of var- 
ious ages and performance levels. Such a system might allow 
one to follow the progress of a student from kindergarten to 
Grade 12, using the same measures and materials or linking 
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measures and materials. Many of the issues raised above (e.g., 
establishment of normative levels and growth rates using stan- 
dard material) would need to be addressed before such as sys- 
tem could be realized, but the research on the development of 
CBM measures in reading is at a point where development of 
such a seamless and flexible system is conceivable. 
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NOTES 

1 . These measures can be found under various names in the litera- 
ture. Reading aloud is often referred to as oral reading fluency 
and oral reading. Maze selection is referred to as maze and 
multiple -choice cloze. Word identification is also referred to as 
isolated word reading, word list reading, and word identification 
fluency. We have chosen to use reading aloud, maze selection, 
and word identification consistently throughout the article be- 
cause these terms represent the behavior that the student performs 
on each CBM measurement task; that is, students read aloud, se- 
lect maze choices, or identify words. 

2. Hintze and Silberglitt (2005) and Silberglitt and Hintze (2005) 
examined various methods for determining cut scores using the 
CBM measures. Results supported the use of logistic regression 
either alone or in combination with Receiver Operating Charac- 
teristics (ROC) curve analysis. Given the fact that using logistic 
regression alone is easier than using it with ROC and given the 
similarities in the pattern of results between the two approaches, 
we report results here for logistic regression analyses only. 

3. Although the research supported the use of OLS for calculating 
slope, this is not a method that can be easily calculated by hand 
by teachers. Parker and Tindal (1992) proposed an alternative, 
Tukey I, that had good statistical properties and could be calcu- 
lated by hand. To our knowledge, this method has never caught 
on in either CBM research or practice. In research, the over- 
whelming majority of studies calculate slope using OLS. 
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